Building a song recommender

------------- Dataset used: ------------- Million Songs Dataset Source: http://labrosa.ee.columbia.edu/millionsong/ Paper: http://ismir2011.ismir.net/papers/OS6-1.pdf The current notebook uses a subset of the above data containing 10,000 songs obtained from: https://github.com/turi-code/tutorials/blob/master/notebooks/recsys_rank_10K_song.ipynb

In [1]:
%matplotlib inline

import pandas
from sklearn.cross_validation import train_test_split
import numpy as np
import time
from sklearn.externals import joblib
import Recommenders as Recommenders
import Evaluation as Evaluation

Load music data


In [2]:
#Read userid-songid-listen_count triplets
#This step might take time to download data from external sources
triplets_file = 'https://static.turi.com/datasets/millionsong/10000.txt'
songs_metadata_file = 'https://static.turi.com/datasets/millionsong/song_data.csv'

song_df_1 = pandas.read_table(triplets_file,header=None)
song_df_1.columns = ['user_id', 'song_id', 'listen_count']

#Read song  metadata
song_df_2 =  pandas.read_csv(songs_metadata_file)

#Merge the two dataframes above to create input dataframe for recommender systems
song_df = pandas.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on="song_id", how="left")

Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.


In [3]:
song_df.head()


Out[3]:
user_id song_id listen_count title release artist_name year
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1 The Cove Thicker Than Water Jack Johnson 0
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2 Entre Dos Aguas Flamenco Para Niños Paco De Lucia 1976
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1 Stronger Graduation Kanye West 2007
3 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1 Constellations In Between Dreams Jack Johnson 2005
4 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODACBL12A8C13C273 1 Learn To Fly There Is Nothing Left To Lose Foo Fighters 1999

Length of the dataset


In [4]:
len(song_df)


Out[4]:
2000000

Create a subset of the dataset


In [5]:
song_df = song_df.head(10000)

#Merge song title and artist_name columns to make a merged column
song_df['song'] = song_df['title'].map(str) + " - " + song_df['artist_name']

In [6]:
song_grouped = song_df.groupby(['song']).agg({'listen_count': 'count'}).reset_index()
grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage']  = song_grouped['listen_count'].div(grouped_sum)*100
song_grouped.sort_values(['listen_count', 'song'], ascending = [0,1])


Out[6]:
song listen_count percentage
3660 Sehr kosmisch - Harmonia 45 0.45
4678 Undo - Björk 32 0.32
5105 You're The One - Dwight Yoakam 32 0.32
1071 Dog Days Are Over (Radio Edit) - Florence + Th... 28 0.28
3655 Secrets - OneRepublic 28 0.28
4378 The Scientist - Coldplay 27 0.27
4712 Use Somebody - Kings Of Leon 27 0.27
3476 Revelry - Kings Of Leon 26 0.26
1387 Fireflies - Charttraxx Karaoke 24 0.24
1862 Horn Concerto No. 4 in E flat K495: II. Romanc... 23 0.23
1805 Hey_ Soul Sister - Train 22 0.22
5032 Yellow - Coldplay 22 0.22
808 Clocks - Coldplay 21 0.21
2620 Lucky (Album Version) - Jason Mraz & Colbie Ca... 20 0.20
2299 Just Dance - Lady GaGa / Colby O'Donis 19 0.19
456 Billionaire [feat. Bruno Mars] (Explicit Albu... 18 0.18
2689 Marry Me - Train 18 0.18
3064 OMG - Usher featuring will.i.am 18 0.18
4543 Tive Sim - Cartola 18 0.18
142 Alejandro - Lady GaGa 17 0.17
726 Catch You Baby (Steve Pitron & Max Sanna Radio... 17 0.17
1410 Float On - Modest Mouse 17 0.17
3868 Somebody To Love - Justin Bieber 17 0.17
631 Bulletproof - La Roux 16 0.16
1143 Drop The World - Lil Wayne / Eminem 16 0.16
3038 Nothin' On You [feat. Bruno Mars] (Album Versi... 16 0.16
4465 They Might Follow You - Tiny Vipers 16 0.16
870 Cosmic Love - Florence + The Machine 15 0.15
899 Creep (Explicit) - Radiohead 15 0.15
1680 Halo - Beyoncé 15 0.15
... ... ... ...
5094 You Yourself are Too Serious - The Mercury Pro... 1 0.01
5098 You'll Never Know (My Love) (Bovellian 07 Mix)... 1 0.01
5100 You're A Wolf (Album) - Sea Wolf 1 0.01
5102 You're Gonna Miss Me When I'm Gone - Brooks & ... 1 0.01
5103 You're Not Alone - ATB 1 0.01
5104 You're Not Alone - Olive 1 0.01
5108 You've Passed - Neutral Milk Hotel 1 0.01
5109 Young - Hollywood Undead 1 0.01
5111 Younger Than Springtime - William Tabbert 1 0.01
5112 Your Arms Feel Like home - 3 Doors Down 1 0.01
5113 Your Every Idol - Telefon Tel Aviv 1 0.01
5114 Your Ex-Lover Is Dead (Album Version) - Stars 1 0.01
5115 Your Guardian Angel - The Red Jumpsuit Apparatus 1 0.01
5117 Your House - Jimmy Eat World 1 0.01
5118 Your Love - The Outfield 1 0.01
5121 Your Mouth - Telefon Tel Aviv 1 0.01
5123 Your Song (Alternate Take 10) - Cilla Black 1 0.01
5126 Your Visits Are Getting Shorter - Bloc Party 1 0.01
5127 Your Woman - White Town 1 0.01
5130 Ze Rook Naar Rozen - Rob De Nijs 1 0.01
5131 Zebra - Beach House 1 0.01
5132 Zebra - Man Man 1 0.01
5133 Zero - The Pain Machinery 1 0.01
5135 Zopf: Pigtail - Penguin Café Orchestra 1 0.01
5137 aNYway - Armand Van Helden & A-TRAK Present Du... 1 0.01
5139 high fives - Four Tet 1 0.01
5140 in white rooms - Booka Shade 1 0.01
5143 paranoid android - Christopher O'Riley 1 0.01
5149 ¿Lo Ves? [Piano Y Voz] - Alejandro Sanz 1 0.01
5150 Época - Gotan Project 1 0.01

5151 rows × 3 columns

Count number of unique users in the dataset


In [8]:
users = song_df['user_id'].unique()

In [9]:
len(users)


Out[9]:
365

Quiz 1. Count the number of unique songs in the dataset


In [10]:
###Fill in the code here
songs = song_df['song'].unique()
len(songs)


Out[10]:
5151

Create a song recommender


In [11]:
train_data, test_data = train_test_split(song_df, test_size = 0.20, random_state=0)
print(train_data.head(5))


                                       user_id             song_id  \
7389  94d5bdc37683950e90c56c9b32721edb5d347600  SOXNZOW12AB017F756   
9275  1012ecfd277b96487ed8357d02fa8326b13696a5  SOXHYVQ12AB0187949   
2995  15415fa2745b344bce958967c346f2a89f792f63  SOOSZAZ12A6D4FADF8   
5316  ffadf9297a99945c0513cd87939d91d8b602936b  SOWDJEJ12A8C1339FE   
356   5a905f000fc1ff3df7ca807d57edb608863db05d  SOAMPRJ12A8AE45F38   

      listen_count                 title  \
7389             2      Half Of My Heart   
9275             1  The Beautiful People   
2995             1     Sanctify Yourself   
5316             4     Heart Cooks Brain   
356             20                 Rorol   

                                                release      artist_name  \
7389                                     Battle Studies       John Mayer   
9275             Antichrist Superstar (Ecopac Explicit)   Marilyn Manson   
2995                             Glittering Prize 81/92     Simple Minds   
5316  Everything Is Nice: The Matador Records 10th A...     Modest Mouse   
356                               Identification Parade  Octopus Project   

      year                                   song  
7389     0          Half Of My Heart - John Mayer  
9275     0  The Beautiful People - Marilyn Manson  
2995  1985       Sanctify Yourself - Simple Minds  
5316  1997       Heart Cooks Brain - Modest Mouse  
356   2002                Rorol - Octopus Project  

Simple popularity-based recommender class (Can be used as a black box)


In [ ]:
#Recommenders.popularity_recommender_py

Create an instance of popularity based recommender class


In [12]:
pm = Recommenders.popularity_recommender_py()
pm.create(train_data, 'user_id', 'song')

Use the popularity model to make some predictions


In [13]:
user_id = users[5]
pm.recommend(user_id)


Out[13]:
user_id song score Rank
3194 4bd88bfb25263a75bbdd467e74018f4ae570e5df Sehr kosmisch - Harmonia 37 1
4083 4bd88bfb25263a75bbdd467e74018f4ae570e5df Undo - Björk 27 2
931 4bd88bfb25263a75bbdd467e74018f4ae570e5df Dog Days Are Over (Radio Edit) - Florence + Th... 24 3
4443 4bd88bfb25263a75bbdd467e74018f4ae570e5df You're The One - Dwight Yoakam 24 4
3034 4bd88bfb25263a75bbdd467e74018f4ae570e5df Revelry - Kings Of Leon 21 5
3189 4bd88bfb25263a75bbdd467e74018f4ae570e5df Secrets - OneRepublic 21 6
4112 4bd88bfb25263a75bbdd467e74018f4ae570e5df Use Somebody - Kings Of Leon 21 7
1207 4bd88bfb25263a75bbdd467e74018f4ae570e5df Fireflies - Charttraxx Karaoke 20 8
1577 4bd88bfb25263a75bbdd467e74018f4ae570e5df Hey_ Soul Sister - Train 19 9
1626 4bd88bfb25263a75bbdd467e74018f4ae570e5df Horn Concerto No. 4 in E flat K495: II. Romanc... 19 10

Quiz 2: Use the popularity based model to make predictions for the following user id (Note the difference in recommendations from the first user id).


In [14]:
###Fill in the code here
user_id = users[8]
pm.recommend(user_id)


Out[14]:
user_id song score Rank
3194 9bb911319fbc04f01755814cb5edb21df3d1a336 Sehr kosmisch - Harmonia 37 1
4083 9bb911319fbc04f01755814cb5edb21df3d1a336 Undo - Björk 27 2
931 9bb911319fbc04f01755814cb5edb21df3d1a336 Dog Days Are Over (Radio Edit) - Florence + Th... 24 3
4443 9bb911319fbc04f01755814cb5edb21df3d1a336 You're The One - Dwight Yoakam 24 4
3034 9bb911319fbc04f01755814cb5edb21df3d1a336 Revelry - Kings Of Leon 21 5
3189 9bb911319fbc04f01755814cb5edb21df3d1a336 Secrets - OneRepublic 21 6
4112 9bb911319fbc04f01755814cb5edb21df3d1a336 Use Somebody - Kings Of Leon 21 7
1207 9bb911319fbc04f01755814cb5edb21df3d1a336 Fireflies - Charttraxx Karaoke 20 8
1577 9bb911319fbc04f01755814cb5edb21df3d1a336 Hey_ Soul Sister - Train 19 9
1626 9bb911319fbc04f01755814cb5edb21df3d1a336 Horn Concerto No. 4 in E flat K495: II. Romanc... 19 10

Build a song recommender with personalization

We now create an item similarity based collaborative filtering model that allows us to make personalized recommendations to each user.

Class for an item similarity based personalized recommender system (Can be used as a black box)


In [ ]:
#Recommenders.item_similarity_recommender_py

Create an instance of item similarity based recommender class


In [15]:
is_model = Recommenders.item_similarity_recommender_py()
is_model.create(train_data, 'user_id', 'song')

Use the personalized model to make some song recommendations


In [16]:
#Print the songs for the user in training data
user_id = users[5]
user_items = is_model.get_user_items(user_id)
#
print("------------------------------------------------------------------------------------")
print("Training data songs for the user userid: %s:" % user_id)
print("------------------------------------------------------------------------------------")

for user_item in user_items:
    print(user_item)

print("----------------------------------------------------------------------")
print("Recommendation process going on:")
print("----------------------------------------------------------------------")

#Recommend songs for the user using personalized model
is_model.recommend(user_id)


------------------------------------------------------------------------------------
Training data songs for the user userid: 4bd88bfb25263a75bbdd467e74018f4ae570e5df:
------------------------------------------------------------------------------------
Just Lose It - Eminem
Without Me - Eminem
16 Candles - The Crests
Speechless - Lady GaGa
Push It - Salt-N-Pepa
Ghosts 'n' Stuff (Original Instrumental Mix) - Deadmau5
Say My Name - Destiny's Child
My Dad's Gone Crazy - Eminem / Hailie Jade
The Real Slim Shady - Eminem
Somebody To Love - Justin Bieber
Forgive Me - Leona Lewis
Missing You - John Waite
Ya Nada Queda - Kudai
----------------------------------------------------------------------
Recommendation process going on:
----------------------------------------------------------------------
No. of unique songs for the user: 13
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :2097
Out[16]:
user_id song score rank
0 4bd88bfb25263a75bbdd467e74018f4ae570e5df Superman - Eminem / Dina Rae 0.088692 1
1 4bd88bfb25263a75bbdd467e74018f4ae570e5df Mockingbird - Eminem 0.067663 2
2 4bd88bfb25263a75bbdd467e74018f4ae570e5df I'm Back - Eminem 0.065385 3
3 4bd88bfb25263a75bbdd467e74018f4ae570e5df U Smile - Justin Bieber 0.064525 4
4 4bd88bfb25263a75bbdd467e74018f4ae570e5df Here Without You - 3 Doors Down 0.062293 5
5 4bd88bfb25263a75bbdd467e74018f4ae570e5df Hellbound - J-Black & Masta Ace 0.055769 6
6 4bd88bfb25263a75bbdd467e74018f4ae570e5df The Seed (2.0) - The Roots / Cody Chestnutt 0.052564 7
7 4bd88bfb25263a75bbdd467e74018f4ae570e5df I'm The One Who Understands (Edit Version) - War 0.052564 8
8 4bd88bfb25263a75bbdd467e74018f4ae570e5df Falling - Iration 0.052564 9
9 4bd88bfb25263a75bbdd467e74018f4ae570e5df Armed And Ready (2009 Digital Remaster) - The ... 0.052564 10

Quiz 3. Use the personalized model to make recommendations for the following user id. (Note the difference in recommendations from the first user id.)


In [17]:
user_id = users[7]
#Fill in the code here
user_items = is_model.get_user_items(user_id)
#
print("------------------------------------------------------------------------------------")
print("Training data songs for the user userid: %s:" % user_id)
print("------------------------------------------------------------------------------------")

for user_item in user_items:
    print(user_item)

print("----------------------------------------------------------------------")
print("Recommendation process going on:")
print("----------------------------------------------------------------------")

#Recommend songs for the user using personalized model
is_model.recommend(user_id)


------------------------------------------------------------------------------------
Training data songs for the user userid: 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec:
------------------------------------------------------------------------------------
Swallowed In The Sea - Coldplay
Life In Technicolor ii - Coldplay
Life In Technicolor - Coldplay
The Scientist - Coldplay
Trouble - Coldplay
Strawberry Swing - Coldplay
Lost! - Coldplay
Clocks - Coldplay
----------------------------------------------------------------------
Recommendation process going on:
----------------------------------------------------------------------
No. of unique songs for the user: 8
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :3429
Out[17]:
user_id song score rank
0 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec She Just Likes To Fight - Four Tet 0.281579 1
1 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec Warning Sign - Coldplay 0.281579 2
2 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec We Never Change - Coldplay 0.281579 3
3 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec Puppetmad - Puppetmastaz 0.281579 4
4 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec God Put A Smile Upon Your Face - Coldplay 0.281579 5
5 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec Susie Q - Creedence Clearwater Revival 0.281579 6
6 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec The Joker - Fatboy Slim 0.281579 7
7 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec Korg Rhythm Afro - Holy Fuck 0.281579 8
8 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec This Unfolds - Four Tet 0.281579 9
9 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec high fives - Four Tet 0.281579 10

We can also apply the model to find similar songs to any song in the dataset


In [18]:
is_model.get_similar_items(['U Smile - Justin Bieber'])


no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :271
Out[18]:
user_id song score rank
0 Somebody To Love - Justin Bieber 0.428571 1
1 Bad Company - Five Finger Death Punch 0.375000 2
2 Love Me - Justin Bieber 0.333333 3
3 One Time - Justin Bieber 0.333333 4
4 Here Without You - 3 Doors Down 0.333333 5
5 Stuck In The Moment - Justin Bieber 0.333333 6
6 Teach Me How To Dougie - California Swag District 0.333333 7
7 Paper Planes - M.I.A. 0.333333 8
8 Already Gone - Kelly Clarkson 0.333333 9
9 The Funeral (Album Version) - Band Of Horses 0.300000 10

Quiz 4. Use the personalized recommender model to get similar songs for the following song.


In [19]:
song = 'Yellow - Coldplay'
###Fill in the code here
is_model.get_similar_items([song])


no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :969
Out[19]:
user_id song score rank
0 Fix You - Coldplay 0.375000 1
1 Creep (Explicit) - Radiohead 0.291667 2
2 Clocks - Coldplay 0.280000 3
3 Seven Nation Army - The White Stripes 0.250000 4
4 Paper Planes - M.I.A. 0.208333 5
5 Halo - Beyoncé 0.200000 6
6 The Funeral (Album Version) - Band Of Horses 0.181818 7
7 In My Place - Coldplay 0.181818 8
8 Kryptonite - 3 Doors Down 0.166667 9
9 When You Were Young - The Killers 0.166667 10

Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves.

Class to calculate precision and recall (This can be used as a black box)


In [20]:
#Evaluation.precision_recall_calculator

Use the above precision recall calculator class to calculate the evaluation measures


In [20]:
start = time.time()

#Define what percentage of users to use for precision recall calculation
user_sample = 0.05

#Instantiate the precision_recall_calculator class
pr = Evaluation.precision_recall_calculator(test_data, train_data, pm, is_model)

#Call method to calculate precision and recall values
(pm_avg_precision_list, pm_avg_recall_list, ism_avg_precision_list, ism_avg_recall_list) = pr.calculate_measures(user_sample)

end = time.time()
print(end - start)


Length of user_test_and_training:319
Length of user sample:15
Getting recommendations for user:ea3b77e3f9b5688dc3998b2e706ea2c0ca48b8eb
No. of unique songs for the user: 15
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1742
Getting recommendations for user:78e0065cacc15d6329be91b77045f12ab18cbea5
No. of unique songs for the user: 9
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :215
Getting recommendations for user:5ab56ead71b71022f7043fef70a178b7035629b6
No. of unique songs for the user: 6
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1896
Getting recommendations for user:881f2e87fe2a45ae27d6e235c156c762ac3cb82a
No. of unique songs for the user: 6
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1268
Getting recommendations for user:5d5e0142e54c3bb7b69f548c2ee55066c90700eb
No. of unique songs for the user: 31
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :3306
Getting recommendations for user:be0a4b64e9689c46e94b5a9a9c7910ee61aeb16f
No. of unique songs for the user: 76
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :12637
Getting recommendations for user:a8268c552c1122626ba8ab4d7cf2f799de7931b2
No. of unique songs for the user: 23
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :3834
Getting recommendations for user:53ba380d234fd6022818340983570354ee207f6b
No. of unique songs for the user: 10
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :348
Getting recommendations for user:1a849df9dabb15845eb932d46d81e2fd77176786
No. of unique songs for the user: 44
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :6357
Getting recommendations for user:8814f5d1f1d7177aa2efb6de6454504f3bb7b7bc
No. of unique songs for the user: 5
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :741
Getting recommendations for user:d9f2ea75b38f548535caee41d2c0b0e3f9859b1b
No. of unique songs for the user: 8
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :2187
Getting recommendations for user:f608c215606e6421a429ea28ad08243241d5347d
No. of unique songs for the user: 27
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :2286
Getting recommendations for user:a54543f7282b66b3c8423181bf2789e1c7eb2edc
No. of unique songs for the user: 10
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1014
Getting recommendations for user:ea64e003562d2f0f39e5a7dd84af5b1969e0fea3
No. of unique songs for the user: 10
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1151
Getting recommendations for user:95b2ebf54cd69d732fa433ee8994be5818793efb
No. of unique songs for the user: 12
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :646
61.7630341053009

Code to plot precision recall curve


In [21]:
import pylab as pl

#Method to generate precision and recall curve
def plot_precision_recall(m1_precision_list, m1_recall_list, m1_label, m2_precision_list, m2_recall_list, m2_label):
    pl.clf()    
    pl.plot(m1_recall_list, m1_precision_list, label=m1_label)
    pl.plot(m2_recall_list, m2_precision_list, label=m2_label)
    pl.xlabel('Recall')
    pl.ylabel('Precision')
    pl.ylim([0.0, 0.20])
    pl.xlim([0.0, 0.20])
    pl.title('Precision-Recall curve')
    #pl.legend(loc="upper right")
    pl.legend(loc=9, bbox_to_anchor=(0.5, -0.2))
    pl.show()

In [22]:
print("Plotting precision recall curves.")

plot_precision_recall(pm_avg_precision_list, pm_avg_recall_list, "popularity_model",
                      ism_avg_precision_list, ism_avg_recall_list, "item_similarity_model")


Plotting precision recall curves.

Generate Precision Recall curve using pickled results on a larger data subset(Python 3)


In [23]:
print("Plotting precision recall curves for a larger subset of data (100,000 rows) (user sample = 0.005).")

#Read the persisted files 
pm_avg_precision_list = joblib.load('pm_avg_precision_list_3.pkl')
pm_avg_recall_list = joblib.load('pm_avg_recall_list_3.pkl')
ism_avg_precision_list = joblib.load('ism_avg_precision_list_3.pkl')
ism_avg_recall_list = joblib.load('ism_avg_recall_list_3.pkl')

print("Plotting precision recall curves.")
plot_precision_recall(pm_avg_precision_list, pm_avg_recall_list, "popularity_model",
                      ism_avg_precision_list, ism_avg_recall_list, "item_similarity_model")


Plotting precision recall curves for a larger subset of data (100,000 rows) (user sample = 0.005).
Plotting precision recall curves.

Generate Precision Recall curve using pickled results on a larger data subset(Python 2.7)


In [24]:
print("Plotting precision recall curves for a larger subset of data (100,000 rows) (user sample = 0.005).")

pm_avg_precision_list = joblib.load('pm_avg_precision_list_2.pkl')
pm_avg_recall_list = joblib.load('pm_avg_recall_list_2.pkl')
ism_avg_precision_list = joblib.load('ism_avg_precision_list_2.pkl')
ism_avg_recall_list = joblib.load('ism_avg_recall_list_2.pkl')

print("Plotting precision recall curves.")
plot_precision_recall(pm_avg_precision_list, pm_avg_recall_list, "popularity_model",
                      ism_avg_precision_list, ism_avg_recall_list, "item_similarity_model")


Plotting precision recall curves for a larger subset of data (100,000 rows) (user sample = 0.005).
Plotting precision recall curves.

The curve shows that the personalized model provides much better performance over the popularity model.

Matrix Factorization based Recommender System

Using SVD matrix factorization based collaborative filtering recommender system -------------------------------------------------------------------------------- The following code implements a Singular Value Decomposition (SVD) based matrix factorization collaborative filtering recommender system. The user ratings matrix used is a small matrix as follows: Item0 Item1 Item2 Item3 User0 3 1 2 3 User1 4 3 4 3 User2 3 2 1 5 User3 1 6 5 2 User4 0 0 5 0 As we can see in the above matrix, all users except user 4 rate all items. The code calculates predicted recommendations for user 4.

Import the required libraries


In [25]:
#Code source written with help from: 
#http://antoinevastel.github.io/machine%20learning/python/2016/02/14/svd-recommender-system.html

import math as mt
import csv
from sparsesvd import sparsesvd #used for matrix factorization
import numpy as np
from scipy.sparse import csc_matrix #used for sparse matrix
from scipy.sparse.linalg import * #used for matrix multiplication

#Note: You may need to install the library sparsesvd. Documentation for 
#sparsesvd method can be found here:
#https://pypi.python.org/pypi/sparsesvd/

Methods to compute SVD and recommendations


In [26]:
#constants defining the dimensions of our User Rating Matrix (URM)
MAX_PID = 4
MAX_UID = 5

#Compute SVD of the user ratings matrix
def computeSVD(urm, K):
    U, s, Vt = sparsesvd(urm, K)

    dim = (len(s), len(s))
    S = np.zeros(dim, dtype=np.float32)
    for i in range(0, len(s)):
        S[i,i] = mt.sqrt(s[i])

    U = csc_matrix(np.transpose(U), dtype=np.float32)
    S = csc_matrix(S, dtype=np.float32)
    Vt = csc_matrix(Vt, dtype=np.float32)
    
    return U, S, Vt

#Compute estimated rating for the test user
def computeEstimatedRatings(urm, U, S, Vt, uTest, K, test):
    rightTerm = S*Vt 

    estimatedRatings = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
    for userTest in uTest:
        prod = U[userTest, :]*rightTerm
        #we convert the vector to dense format in order to get the indices 
        #of the movies with the best estimated ratings 
        estimatedRatings[userTest, :] = prod.todense()
        recom = (-estimatedRatings[userTest, :]).argsort()[:250]
    return recom

Use SVD to make predictions for a test user id, say 4


In [27]:
#Used in SVD calculation (number of latent factors)
K=2

#Initialize a sample user rating matrix
urm = np.array([[3, 1, 2, 3],[4, 3, 4, 3],[3, 2, 1, 5], [1, 6, 5, 2], [5, 0,0 , 0]])
urm = csc_matrix(urm, dtype=np.float32)

#Compute SVD of the input user ratings matrix
U, S, Vt = computeSVD(urm, K)

#Test user set as user_id 4 with ratings [0, 0, 5, 0]
uTest = [4]
print("User id for whom recommendations are needed: %d" % uTest[0])

#Get estimated rating for test user
print("Predictied ratings:")
uTest_recommended_items = computeEstimatedRatings(urm, U, S, Vt, uTest, K, True)
print(uTest_recommended_items)


User id for whom recommendations are needed: 4
Predictied ratings:
[0 3 2 1]

Quiz 4

a.) Change the input matrix row for test userid 4 in the user ratings matrix to the following value. Note the difference in predicted recommendations in this case. i.) [5 0 0 0] (Note*: The predicted ratings by the code include the items already rated by test user as well. This has been left purposefully like this for better understanding of SVD). SVD tutorial: http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

Understanding Intuition behind SVD

SVD result gives three matrices as output: U, S and Vt (T in Vt means transpose). Matrix U represents user vectors and Matrix Vt represents item vectors. In simple terms, U represents users as 2 dimensional points in the latent vector space, and Vt represents items as 2 dimensional points in the same space.
Next, we print the matrices U, S and Vt and try to interpret them. Think how the points for users and items will look like in a 2 dimensional axis. For example, the following code plots all user vectors from the matrix U in the 2 dimensional space. Similarly, we plot all the item vectors in the same plot from the matrix Vt.

In [28]:
%matplotlib inline
from pylab import *

#Plot all the users
print("Matrix Dimensions for U")
print(U.shape)

for i in range(0, U.shape[0]):
    plot(U[i,0], U[i,1], marker = "*", label="user"+str(i))

for j in range(0, Vt.T.shape[0]):
    plot(Vt.T[j,0], Vt.T[j,1], marker = 'd', label="item"+str(j))    
    
legend(loc="upper right")
title('User vectors in the Latent semantic space')
ylim([-0.7, 0.7])
xlim([-0.7, 0])
show()


Matrix Dimensions for U
(5, 2)